11 research outputs found
Visual Representation Learning with Limited Supervision
The quality of a Computer Vision system is proportional to the rigor of data representation it is built upon. Learning expressive representations of images is therefore the centerpiece to almost every computer vision application, including image search, object detection and classification, human re-identification, object tracking, pose understanding, image-to-image translation, and embodied agent navigation to name a few. Deep Neural Networks are most often seen among the modern methods of representation learning. The limitation is, however, that deep representation learning methods require extremely large amounts of manually labeled data for training. Clearly, annotating vast amounts of images for various environments is infeasible due to cost and time constraints. This requirement of obtaining labeled data is a prime restriction regarding pace of the development of visual recognition systems.
In order to cope with the exponentially growing amounts of visual data generated daily, machine learning algorithms have to at least strive to scale at a similar rate.
The second challenge consists in the learned representations having to generalize to novel objects, classes, environments and tasks in order to accommodate to the diversity of the visual world.
Despite the evergrowing number of recent publications tangentially addressing the topic of learning generalizable representations, efficient generalization is yet to be achieved. This dissertation attempts to tackle the problem of learning visual representations that can generalize to novel settings while requiring few labeled examples.
In this research, we study the limitations of the existing supervised representation learning approaches and propose a framework that improves the generalization of learned features by exploiting visual similarities between images which are not captured by provided manual annotations. Furthermore, to mitigate the common requirement of large scale manually annotated datasets, we propose several approaches that can learn expressive representations without human-attributed labels, in a self-supervised fashion, by grouping highly-similar samples into surrogate classes based on progressively learned representations.
The development of computer vision as science is preconditioned upon the seamless ability of a machine to record and disentangle pictures' attributes that were expected to only be conceived by humans. As such, particular interest was dedicated to the ability to analyze the means of artistic expression and style which depicts a more complex task than merely breaking an image down to colors and pixels. The ultimate test for this ability is the task of style transfer which involves altering the style of an image while keeping its content. An effective solution of style transfer requires learning such image representation which would allow disentangling image style and its content.
Moreover, particular artistic styles come with idiosyncrasies that affect which content details should be preserved and which discarded.
Another pitfall here is that it is impossible to get pixel-wise annotations of style and how the style should be altered.
We address this problem by proposing an unsupervised approach that enables encoding the image content in such a way that is required by a particular style.
The proposed approach exchanges the style of an input image by first extracting the content representation in a style-aware way and then rendering it in a new style using a style-specific decoder network, achieving compelling results in image and video stylization.
Finally, we combine supervised and self-supervised representation learning techniques for the task of human and animals pose understanding. The proposed method enables transfer of the representation learned for recognition of human poses to proximal mammal species without using labeled animal images. This approach is not limited to dense pose estimation and could potentially enable autonomous agents from robots to self-driving cars to retrain themselves and adapt to novel environments based on learning from previous experiences
Deep Unsupervised Similarity Learning using Partially Ordered Sets
Unsupervised learning of visual similarities is of paramount importance to
computer vision, particularly due to lacking training data for fine-grained
similarities. Deep learning of similarities is often based on relationships
between pairs or triplets of samples. Many of these relations are unreliable
and mutually contradicting, implying inconsistencies when trained without
supervision information that relates different tuples or triplets to each
other. To overcome this problem, we use local estimates of reliable
(dis-)similarities to initially group samples into compact surrogate classes
and use local partial orders of samples to classes to link classes to each
other. Similarity learning is then formulated as a partial ordering task with
soft correspondences of all samples to classes. Adopting a strategy of
self-supervision, a CNN is trained to optimally represent samples in a mutually
consistent manner while updating the classes. The similarity learning and
grouping procedure are integrated in a single model and optimized jointly. The
proposed unsupervised approach shows competitive performance on detailed pose
estimation and object classification.Comment: Accepted for publication at IEEE Computer Vision and Pattern
Recognition 201
Discovering Relationships between Object Categories via Universal Canonical Maps
We tackle the problem of learning the geometry of multiple categories of
deformable objects jointly. Recent work has shown that it is possible to learn
a unified dense pose predictor for several categories of related objects.
However, training such models requires to initialize inter-category
correspondences by hand. This is suboptimal and the resulting models fail to
maintain correct correspondences as individual categories are learned. In this
paper, we show that improved correspondences can be learned automatically as a
natural byproduct of learning category-specific dense pose predictors. To do
this, we express correspondences between different categories and between
images and categories using a unified embedding. Then, we use the latter to
enforce two constraints: symmetric inter-category cycle consistency and a new
asymmetric image-to-category cycle consistency. Without any manual annotations
for the inter-category correspondences, we obtain state-of-the-art alignment
results, outperforming dedicated methods for matching 3D shapes. Moreover, the
new model is also better at the task of dense pose prediction than prior work.Comment: Accepted at CVPR 2021; Project page:
https://gdude.de/discovering-3d-obj-re
Transferring Dense Pose to Proximal Animal Classes
Recent contributions have demonstrated that it is possible to recognize the
pose of humans densely and accurately given a large dataset of poses annotated
in detail. In principle, the same approach could be extended to any animal
class, but the effort required for collecting new annotations for each case
makes this strategy impractical, despite important applications in natural
conservation, science and business. We show that, at least for proximal animal
classes such as chimpanzees, it is possible to transfer the knowledge existing
in dense pose recognition for humans, as well as in more general object
detectors and segmenters, to the problem of dense pose recognition in other
classes. We do this by (1) establishing a DensePose model for the new animal
which is also geometrically aligned to humans (2) introducing a multi-head
R-CNN architecture that facilitates transfer of multiple recognition tasks
between classes, (3) finding which combination of known classes can be
transferred most effectively to the new animal and (4) using self-calibrated
uncertainty heads to generate pseudo-labels graded by quality for training a
model for this class. We also introduce two benchmark datasets labelled in the
manner of DensePose for the class chimpanzee and use them to evaluate our
approach, showing excellent transfer learning performance.Comment: Accepted at CVPR 2020; Project page:
https://asanakoy.github.io/densepose-evolutio
Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model
With the recent surge in popularity of AR/VR applications, realistic and
accurate control of 3D full-body avatars has become a highly demanded feature.
A particular challenge is that only a sparse tracking signal is available from
standalone HMDs (Head Mounted Devices), often limited to tracking the user's
head and wrists. While this signal is resourceful for reconstructing the upper
body motion, the lower body is not tracked and must be synthesized from the
limited information provided by the upper body joints. In this paper, we
present AGRoL, a novel conditional diffusion model specifically designed to
track full bodies given sparse upper-body tracking signals. Our model is based
on a simple multi-layer perceptron (MLP) architecture and a novel conditioning
scheme for motion data. It can predict accurate and smooth full-body motion,
particularly the challenging lower body movement. Unlike common diffusion
architectures, our compact architecture can run in real-time, making it
suitable for online body-tracking applications. We train and evaluate our model
on AMASS motion capture dataset, and demonstrate that our approach outperforms
state-of-the-art methods in generated motion accuracy and smoothness. We
further justify our design choices through extensive experiments and ablation
studies.Comment: CVPR 2023, project page: https://dulucas.github.io/agrol